NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Explanation in the Era of Large Language Models

https://doi.org/10.18653/v1/2024.naacl-tutorials.3

Zhu, Zining; Chen, Hanjie; Ye, Xi; Lyu, Qing; Tan, Chenhao; Marasovic, Ana; Wiegreffe, Sarah (June 2024, Association for Computational Linguistics)

Full Text Available
Faithful Chain-of-Thought Reasoning

Lyu, Qing; Havaldar, Shreya; Stein, Adam; Zhang, Li; Rao, Delip; Wong, Eric; Apidianaki, Marianna; Callison-Burch, Chris (November 2023, The 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (IJCNLP-AACL 2023))

While Chain-of-Thought (CoT) prompting boosts Language Models’ (LM) performance on a gamut of complex reasoning tasks, the generated reasoning chain does not necessarily reflect how the model arrives at the answer (aka. faithfulness). We propose Faithful CoT, a reasoning framework involving two stages: Translation (Natural Language query → symbolic reasoning chain) and Problem Solving (reasoning chain → answer), using an LM and a deterministic solver respectively. This guarantees that the reasoning chain provides a faithful explanation of the final answer. Aside from interpretability, Faithful CoT also improves empirical performance: it outperforms standard CoT on 9 of 10 benchmarks from 4 diverse domains, with a relative accuracy gain of 6.3% on Math Word Problems (MWP), 3.4% on Planning, 5.5% on Multi-hop Question Answering (QA), and 21.4% on Relational Inference. Furthermore, with GPT-4 and Codex, it sets the new state-of-the-art few-shot performance on 7 datasets (with 95.0+ accuracy on 6 of them), showing a strong synergy between faithfulness and accuracy.
more » « less
Full Text Available
Explanation-based Finetuning Makes Models More Robust to Spurious Cues

https://doi.org/10.18653/v1/2023.acl-long.242

Ludan, Josh Magnus; Meng, Yixuan; Nguyen, Tai; Shah, Saurabh; Lyu, Qing; Apidianaki, Marianna; Callison-Burch, Chris (January 2023, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics)

Large Language Models (LLMs) are so powerful that they sometimes learn correlations between labels and features that are irrelevant to the task, leading to poor generalization on out-of-distribution data. We propose explanation-based finetuning as a general approach to mitigate LLMs’ reliance on spurious correlations. Unlike standard finetuning where the model only predicts the answer given the input, we finetune the model to additionally generate a free-text explanation supporting its answer. To evaluate our method, we finetune the model on artificially constructed training sets containing different types of spurious cues, and test it on a test set without these cues. Compared to standard finetuning, our method makes GPT-3 (davinci) remarkably more robust against spurious cues in terms of accuracy drop across four classification tasks: ComVE (+1.2), CREAK (+9.1), e-SNLI (+15.4), and SBIC (+6.5). The efficacy generalizes across multiple model families and scales, with greater gains for larger models. Finally, our method also works well with explanations generated by the model, implying its applicability to more datasets without human-written explanations.
more » « less
Full Text Available
Show Me More Details: Discovering Hierarchies of Procedures from Semi-structured Web Data

https://doi.org/10.18653/v1/2022.acl-long.214

Zhou, Shuyan; Zhang, Li; Yang, Yue; Lyu, Qing; Yin, Pengcheng; Callison-Burch, Chris; Neubig, Graham (January 2022, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers))

Procedures are inherently hierarchical. To “make videos”, one may need to “purchase a camera”, which in turn may require one to “set a budget”. While such hierarchical knowledge is critical for reasoning about complex procedures, most existing work has treated procedures as shallow structures without modeling the parent-child relation. In this work, we attempt to construct an open-domain hierarchical knowledge-base (KB) of procedures based on wikiHow, a website containing more than 110k instructional articles, each documenting the steps to carry out a complex procedure. To this end, we develop a simple and efficient method that links steps (e.g., “purchase a camera”) in an article to other articles with similar goals (e.g., “how to choose a camera”), recursively constructing the KB. Our method significantly outperforms several strong baselines according to automatic evaluation, human judgment, and application to downstream tasks such as instructional video retrieval.
more » « less
Full Text Available
Goal-Oriented Script Construction

Lyu, Qing; Zhang, Li; Callison-Burch Chris (January 2021, Proceedings of the 14th International Conference on Natural Language Generation)

The knowledge of scripts, common chains of events in stereotypical scenarios, is a valuable asset for task-oriented natural language understanding systems. We propose the Goal-Oriented Script Construction task, where a model produces a sequence of steps to accomplish a given goal. We pilot our task on the first multilingual script learning dataset supporting 18 languages collected from wikiHow, a website containing half a million how-to articles. For baselines, we consider both a generation-based approach using a language model and a retrieval-based approach by first retrieving the relevant steps from a large candidate pool and then ordering them. We show that our task is practical, feasible but challenging for state-of-the-art Transformer models, and that our methods can be readily deployed for various other datasets and domains with decent zero-shot performance.
more » « less
Full Text Available
Visual Goal-Step Inference using wikiHow

https://doi.org/10.18653/v1/2021.emnlp-main.165

Yang, Yue; Panagopoulou, Artemis; Lyu, Qing; Zhang, Li; Yatskar, Mark; Callison-Burch, Chris (January 2021, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing)

Understanding what sequence of steps are needed to complete a goal can help artificial intelligence systems reason about human activities. Past work in NLP has examined the task of goal-step inference for text. We introduce the visual analogue. We propose the Visual Goal-Step Inference (VGSI) task, where a model is given a textual goal and must choose which of four images represents a plausible step towards that goal. With a new dataset harvested from wikiHow consisting of 772,277 images representing human actions, we show that our task is challenging for state-of-the-art multimodal models. Moreover, the multimodal representation learned from our data can be effectively transferred to other datasets like HowTo100m, increasing the VGSI accuracy by 15 - 20%. Our task will facilitate multimodal reasoning about procedural events.
more » « less
Full Text Available
Reasoning about Goals, Steps, and Temporal Ordering with WikiHow

https://doi.org/10.18653/v1/2020.emnlp-main.374

Zhang, Li; Lyu, Qing; Callison-Burch, Chris (January 2020, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing)

We propose a suite of reasoning tasks on two types of relations between procedural events: goal-step relations (“learn poses” is a step in the larger goal of “doing yoga”) and step-step temporal relations (“buy a yoga mat” typically precedes “learn poses”). We introduce a dataset targeting these two relations based on wikiHow, a website of instructional how-to articles. Our human-validated test set serves as a reliable benchmark for common-sense inference, with a gap of about 10% to 20% between the performance of state-of-the-art transformer models and human performance. Our automatically-generated training set allows models to effectively transfer to out-of-domain tasks requiring knowledge of procedural events, with greatly improved performances on SWAG, Snips, and Story Cloze Test in zero- and few-shot settings
more » « less
Full Text Available
Intent Detection with WikiHow

Zhang, Li; Lyu, Qing; Callison-Burch, Chris (January 2020, Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing)

Modern task-oriented dialog systems need to reliably understand users’ intents. Intent detection is even more challenging when moving to new domains or new languages, since there is little annotated data. To address this challenge, we present a suite of pretrained intent detection models which can predict a broad range of intended goals from many actions because they are trained on wikiHow, a comprehensive instructional website. Our models achieve state-of-the-art results on the Snips dataset, the Schema-Guided Dialogue dataset, and all 3 languages of the Facebook multilingual dialog datasets. Our models also demonstrate strong zero- and few-shot performance, reaching over 75% accuracy using only 100 training examples in all datasets.
more » « less
Full Text Available
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Srivastava, Aarohi; Rastogi, Abhinav; Rao, Abhishek; Shoeb, Abu Awal; Abid, Abubakar; Fisch, Adam; Brown, Adam R.; Santoro, Adam; Gupta, Aditya; Garriga-Alonso, Adri; et al (January 2023, Transactions on machine learning research)

Full Text Available

Search for: All records